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ABSTRACT 


Techniques are developed for the design of a monitor of a real-time 
multi-computer system that is under heavy loading. The first portion 
relates to the requirements of partitioning to aid in fault recognition 
and diagnostic routines. The dynamic allocation of system time to the 
system tasks and fault monitoring is developed secondly. System 
reconfiguration of the partitioned subsystems restores the system to 
operation at a degraded level until faults are corrected. The paper 
discusses a Ship Combat Weapon System as an example of a large scale 


multi-computer system monitor. 
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IL. INTRODUCTION 


The monitoring of a small computer system is a relatively simple 
task. But as computer systems grow larger and larger, the problem 
becomes much greater. When the monitoring of a large scale Real-Time 
Multi-Computer system is attacked, many problems arise. One must know 
which program is running in which computer and how each program affects 
the mass of data that flows between the many input /output ports of the 
system. When the system is running near capacity in all processors 
(heavily loaded), the problem becomes even greater. There is little 
time left to process data for monitoring. Therefore, it can be seen 
that monitoring must be done under very severe timing and space 
utilization constraints. This thesis provides the necessary tools 
for calculeting timing and core space requirements. 

The five major portions of a fault monitoring system are 
(1) Partitioning of the total system into small subsystems, (2) Dynamic 
time allocation to better utilize remaining time for fault monitoring 
procedures, (3) Fault recognition techniques, (4) Diagnostic routines, 
(5) System reconfiguration to automatically restore system operation. 
All five of these operations are fully discussed and analyzed so that 
the designer may apply the correct techniques to his system to obtain 
an effective monitor. 

One example of a Real-Time Multi-Computer system is a Ship Combat 
Weapon System. The general aspects of a Ship Combat Weapon System are 
discussed to show the comparisons of a general system to a specific 


one. A simulated Ship Combat Weapon System is shown to contain all 





the necessary parts of the general system and is analyzed in great 
detail. The methods of monitoring this simulated Ship Combat Weapon 
System are analyzed to develop the necessary monitoring techniques. 

A detailed bibliography is presented that covers the area of 
Monitoring of a Real-Time Multi-Computer isystem. A cross reference of 
the methods of monitoring is also given. 

Proper system time Drannieaticn shows that much of the system 
upkeep and naintenance that normally is accomplished during designated 
maintenance periods may now be performed on-line at little or no 


system degradation. 








II. BASIC ELEMENTS OF A FAULT MONITORING SYSTEM 


Fault detection and recognition is the most important maintenance 
function in a Real-Time System. The Fault Recognition program 
basically tests the processing integrity of the system. This program 
requests that the ersten found faulty or suspected of being faulty 
be diagnosed. The purpose of the diagnostic program is to generate 
test data to isolate the fault to a reascnably small section within 
the subsysten. . 

Real-time favlt detection historically has been done at the circuit 
hardware level. In the early stages of Ccevelopment of fault-tolerant 
computers, attention was directed towards: massive redundancy at the 
lowest level - the replication of indivicual components (resistors, 
transistors, etc.). The use of component redundancy has been limited 
by design difficulties and by new developments in componet technology. 
The change from discrete components to integrated circuits has largely 
invalidated the assumption of independent component failures. Without 
it,the advantages of component redundancy are lost. 

The most developed techniques are fault detection by periodic 
diagnosis and the application of parity and similar error codes to 
detect or correct errors in data transmission and storage. The 
periodic diagnosis techniques have progressed from exclusively software 
implementations to software combinations with special-purpose hardware. 


Concurrent diagnosis uses error detecting codes and monitoring circuits. 





A. HARDWARE 

Fault detection and diagnosis by hardware have greatly increased 
the sensitivity and selectivity of finding and correcting errors. In 
the early days, fault detection systems utilized registers that read 


eae 


data at specific times into an output device. Later specially 


built hardware devices set off alarms when an error was aenected. = 
The circuit that produced the fault was located by utilizing a book 
which contained manual diagnosis. As designers progressed, circuits 
were designed to not only send interrupts to the computer notifying 
it of an error, but also allowing the diagnostic routine eee oaee 
access of the error conic Oe 

To assist in locating faults, the hardware system may be partitioned 
into logical subelements that allow reconfiguration. Accurate timing 
of events requires a real time clock in the system, Transient or 
permanent faults may be initially detected by hardware devices but 
efficient identification and location of the faults requires software 
diagnostic routines, Diagnostic routines are built upon the concept of 
detecting faults by executing one instruction at a time. As the 
instructions become more complicated, more circuitry (Microsteps) is 
analyzed for faults. This is repeated until all elements are insured 
to be fault free or a fault is located. 

It can be seen that to effectively detect an error in a timely 

manner requires hardware circuits. Programmatic access to error 


registers allows greater flexibility and speed in diagnosing the 


actual fault. 





B. SOFTWARE 

Many studies and investigations have heen made in the area of 
fault-detection and diagnosis by software. Even the earliest computer 
systems had diagnostic programs to check the computers for soeeene te 
As computers became more complex, the size of diagnostic routines in- 
creased as Gid the time required to write them (in man seen) LOETE) 
By combining fault-detection with diagnostic routines, the total time 
to locate an error was reduced. By allowing periodic maintenance 
checks to be performed using this combined method, large computer 


COMME By combining 


systems reduced their amount of down time. 
fault-detection and diagnostic routines with automatic system re- 

organization, the down time may be reduced to a minimum with imposed 
system degradation,( 4914115) 

Software must also be partitioned into logical subelements to 
allow for program relocation or reconfiguration. Knowing when to 
reconfigure requires monitoring the most critical data of each logical 
subelement program, When critical data of a program are detected as 
being faulty, then the program itself is faulty. 

When the combined technique of fault—detection, diagnostic routines 
and automatic system reorganization is used in a large system, 
additional problems occur. Finding time to run the required tests 
is a problem in a heavily loaded system. If there is barely enough 
time to complete the required tasks, how can we allow extra time for 
Maintenance tests? The answer implies some type of dynamic time 
allocation. Another problem is the manner of presenting this detailed 
data to the system monitor operator in a timely manner. The operator 


must have enough data, but in a short time, to allow him to complete 


the action required of him before the total system fails. 


Z 
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Large and complex computer systems increase the demand for fault 
monitoring systems. Because of this complexity, man by himself 

requires too much time to solve the same problem. The cost of uncorrected 
errors is especially severe in large multi-computer systems and in 
situations which a computer controls a very valuable system, and is 

not readily accessible for human repair, Examples are a real-time 

control computer and a spacecraft computer controlling an inter~ 
planetary mission. A second critical requirement for fault monitoring 
exists when human lives may be affected by computer errors, (e.g. 

military defense systems, high-speed transportation control systems, 


oe 


or medical systems The time to repair such complex system must 
be reduced, and favlt monitoring is one approach. 

The five necessary parts of fault monitoring are: (1) Partitioning, 
(2) Dynemic time allocation, (3) Fault recognition, (4) Diagnostic 
routines, (5) System reconfiguration. These must be considered in 
detail so that the total effect may result in an efficient optimal 
system monitor, The following description describes the necessary 
elements of a fault monitoring multi-computer system. Each part of 


fault monitoring will first be defined in detail. Secondly, some 


explicit uses of these parts will be given for fault monitoring. 


A. PARTITIONING 
Partitioning is the process by which a large complicated system 


is divided into logical subelements. Each subelement has a specific 


10 


a 





function in either hardware or software. By requiring each subelement 
to be a logical subdivision, it may then be replaced in case of a 


[4,10,16] 


detected failure. An example is a program subroutine that 
is located in faulty computer memory core; it may be relocated to a 
fault free core location. Another example is a detected fault in an 
input/output (1/0) channel ; the monitoring system would reconfigure 
the system input/outputs so as to utilize another channel. If a 
Central Processing Unit (CPU) failed, its tasks (processing of logical 
program subelements) could be allocated to other CPU's in a reduced 
operating mode. | 

The most important requirement for successful partitioning is to 
segment the system into logical elements, each having the capability 
of being relocated by the system reconfiguration see OIes 
Thus it is required that core memories, for example, be divided into 
modules, regardless of the actual computer memory organization. The 
hardware items could be partitioned into CPU's,core memory modules and 
input/output channels. Thus when one hardware item fails and another 
similar hardware unit is free, (or only partly utilized) it may be 
used immediately by the system monitoring reconfiguration program. In 
a similar manner, all computer software programs and subprograms 
should be partitioned into logical units of approximately equal core 
size so that immediate reconfiguration may be performed. Programs 
of unequal size would create the problem of moving all programs in 
the computer. If a display device should fail and the display processor 
program becomes idle, the system could be reconfigured to use the 


teletype (or some other output device) along with the teletype 


processing module, 


11 





B, DYNAMIC TIME ALLOCATION 
Time slicing is the division of the total computer system time 
available. This system time is allocated to all of its component 


parts and eoee ee 


Time slicing is used to minimize the 
time required for fault detection and diagnosis in any one time frame. 
During routine system operations when the system is lightly loaded, there 
are large blocks of time Tee in each executive cycle for fault 
recognition and diagnostic analysis. As the system becomes more in 
demand, the time available for analysis is shortened. In order to 
utilize this time more effectively, the fault monitoring program iat 
dynamically allocate the available time. Thus the program must know 
how much time is available for use and thus how much diagnosis may be 
performed in this specific cycle. Flags (or some other method) must 
be set and the proper bookkeeping performed to insure that the most 
critical data is still monitored and analyzed during the most heavily 
loaded period of system utilization. During lull periods, the 
monitoring program must also insure that all components of the 
computer system are analyzed so as to insure complete system 
integrity. ¢12+131 
It follows directly that fault detection routines must be timed 
and must be able to operate under flag (executive) control, By the 
proper allocation of these routines to the pertinent tasks at hand, 
all criteria may be satisfied. Strict timing control of the main 
and monitoring program must be performed and must be programmatically 
available. Ue 
By dynamically controlling the time used for fault monitoring, a 


wide range of operational modes may be effectively monitored. These 


12 





modes may vary from lightly loaded systems to very heavily loaded 
systems. In a lightly loaded system most all operations are concen- 
trated towards fault monitoring, diagnostic analysis and maintenance. 
Under heavy loading, the time for monitoring is very small. In most 
systems, it is zero except for hardware fault monitors. By dynamically 
allocating a small segment of time to special purpose fault monitoring 
routines, increases reliability may be gained with little system 


interference. 


C. FAULT RECOGNITION 

Fault detection in digital computers is implemented either by 
periodic or by concurrent diagnosis. The most common current approach 
is periodic diagnosis which utilizes e diagnostic program stored in 


LG Aled 


the computer memory. Computation is eopoicstiesnuee interrupted 

and the diagnostic program is executed. The diagnostic program itself 
is vulnerable to faults in the memory system. The cost of diagnosis 
consists of: (1) the storage used for the diagnostic program, (2) the 
timeseconsumed by its execution, (3) the time needed for repair, (4) the 
repeated execution (rollback) of the program segment which was run 


[144 


after the last diagnosis. Such time and storage costs are very 
severe in real-time computing. The alternate diagnosis method is 
concurrent diagnosis in which error-detecting codes and monitoring 
circuits are employed to indicate the presence of faults. 

A distinction must be made in fault detection between transient 
and permanent errors. By maintaining a history of detected errors 


with no diagnosable faults, a trend of transient errors may be stored. 


This trend may be utilized to determine an impending major fault. In 


= 





critical locations, special hardware devices must be installed to 
detect errors that would remain undetected or unrecoverable by software 
diagnosis. For example, the current instruction address, in the 
location counter, may be required to locate a fault. If the location 
counter is not programmatically available concurrent with the fault 
detection, then a special fault location register is required. This 
register would automatically copy the contents of the location counter 
at the time of any fanit. 49141 

Faults may originate from either hardware or software. They may 
also be detected by either hardware, software or a combination Br buns 
A list of all the pertinent errors to be detected is required. From 
this list a division must be made between the two types of fault 
origin, hardware or software. This decision is influenced by the 
method of fault detection. This list is used in the process of 
determining partitions. Each favlt detection technique used depends 
upon many system factors that must be taken into account. How the 
system is partitioned affects the grouping of faults. This grouping 


of faults is used by the dynamic time allocation routine. 


D. DIAGNOSTIC ROUTINES 

The diagnostic program is designed to isolate and specify errors 
in main-frame arithmetic and control logic, the various information 
transfers, the various devices and registers, The program is con- 
structed on the general basis that every command in the machine 
repertoire uses a unique set of microinstructions or microsteps, 
leads to a correct result and a second command using the same set 


plus one leads to an incorrect result, then the failure is assumed to 


14 





be in the additional microstep. The detailed diagnosis of errors 
requires that the possiblilities of control signal failure, trans— 
mission path failure, and register failure, be investigated. It is 
often difficult to separate these peste 4) |) 
For a large scale real-time multi-computer system, favlt diagnosis 
routines become even more complex and time consuming than the diagnostic 
routine jus* described. To eliminate these problems, a modularized 
systems appzoach must be utilized. Rather than be concerned with a 
specific cixcuit element failing, we concentrate on detecting modular 
subsystem errors. Critical data are monitored so as to indicate 
failures in any one of our subsystem modules. Upon confirming a 


permanent error in a module, the system reconfiguration program is 


ealiea,[ 4914! 


BE. SYSTEM RECONFIGURATION 

When permanent faults are detected and analyzed in 2 complex 
system, the total system may halt or the system may be reconfigured 
to avoid the faulting component (partitioned submodule) and operated at 


[4,14,15] 


a reduced level. As discussed before, halting a valuable 
system is not acceptable, Reconfiguration may be manual or automatic. 

In the automatic mode, the computer system must maintain a current 
configuration list of all submodules and their operational status. 

Upon notification of a submodule failing (that is part of the operational 
system), reconfiguration is forced. When a submodule is repaired, and 
proven operational, the system monitor operator may indicate this fact 


[13] In the manual 


to the program and then request reconfiguration. 
mode, the system must be examined manually, the new configuration 


determined and then manually implemented. 


(5 





The reconfiguration program contains the necessary information to 
logically interconnect all submodules and to relocate computer programs 
when necessary. By displaying the proposed reconfigured computer 
system to the system monitor operator, approval may be given and the 
new system implemented. Automatic reconfiguration could save as much 
as 99 percent of the time required for manual reconfiguration. 

While automatic Pecont tiratiOn gives an indication of overly 
complicating the problem, it is in reality a simplification. To 
manually maintain the configuration control of a large multi-computer 
system is a large team effort. Many charts, manuals, and switches 
must be coordinated with exacting precision, Automation of this task 
reduces this problem. The system configuration is maintained up to 
date in the computer memory. Switching and logical control of the 
input/output ports are controlled by computer subprograms, When 
these aids are implemented in a reconfiguration module program, the 
time required to change a configuration is reduced to seconds. 

Because of the hypercritical nature of the fault monitoring process, 
special precautions must be observed. The fault monitoring program 
can be duplicated in another computer or reside in a special fault 
tolerant computer. These precautions reduce the danger of a fault 


occurring during execution of the reconfiguration progran. 
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IV. A SPECIFIC EXAMPLE: A SHIP COMBAT 
WEAPON SYSTEM 

One example of a complicated real-time multi-computer system is 
a ship combat weapon system. Many computers are utilized to solve 
special subsystem tasks, such as target detection and missile firing. 
The total computer complex is integrated together by the Combat 
Information Center System. In this real-time system, many control 
and data processing functions must be performed at extremely high’ 
speeds. Most of the sensory and control devices must he electrically 
connected on-line to the system to permit automatic transfer of data. 
Delays in transferring data by means of manual off-line handling of 
tapes, cards, etc., are not acceptable. 

An executive control philosophy was developed for distributing 
the various tasks among the computers. Control of these tasks in 
each computer is maintained by an Executive Routine, Over-all control 
of the multi-computer system is not employed since each computer is 
controlled by its own Executive Routine. Subroutines or tasks are 
controlled by the Executive Routine by the use of flags or alerts. 
Decisions on whether to respond to the flag or alert at any specific 
time are determined by the priority of the input and the time 
available to do the task, This accurate timing is made possible 
by use of an internal real-time clock. When each task is completed, 
the flags and alerts are sensed again and the highest priority task 
remaining is performed. Both periodic and demand type tasks are 


utilized by executive routines. 
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A Ship Combat Weapon System is an example of a large real-time 
multi-computer system in actual use. A detailed study of this system 
will be explored for the methodology of Fault Monitoring techniques. 
The methods evolved for Fault Monitoring in a Ship Combat Weapon 


System will apply towards most large real-time multi-computer systems. 


A. PROPERTIES OF A COMBAT WEAPON SYSTEM AND THE DATA INVOLVED 
A typical Ship Combat System consists of many interconnected 
systems (see fig. 1). They are divided into three major areas: 
(1) Input, (2) Processing and (3) Output. All different types of ° 
input devices are analyzed by the input processors and the relevant 
data transmitted to the processing section. The processing section 
correlates the data from the different sensors. The correlated data is 
transferred to the appropriate output device (Guns, Missiles). 
Each of these three areas have in turn many components, some of 
Which are: 
ies Inpucs 
A. Radar video 
ba sonar input 
Cen amp 
D. Ship information 
2. Processing (Combat System Decisions ) 
A. Three processing computers 
B. Five types of display consoles 
(1) Radar 
(2) Sonar 


(3) CIC (Tactical) 
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(4) Weapons 
(5) Monitor 
3. Outputs 

A. Missile Fire Control System 
(1) Pracking mount 
(2) Missile mount 

Be Gun Fire aentrol System 
(1) Tracking mount 
(2) Gun mount 


Ce Sonar Fire Control System and Torpedo Mount 


B. THE SIMULATED COMBAT WEAPON SYSTEM 

The Shiv Combat Weapon System described in figure 1 is very 
complicated. To analyze this system in detail for Fault Monitoring 
purposes is a large task. A representative system will be simulated 
instead (see fig. 2). The simulation will contain Input, Processing 
and Output sections. By including one system with each fimction, a 
representative but reasonably sized, heavily loaded system may still 
be simulated. 

The Naval Postgraduate School has the wnique computational 
facilities of a large system simulation laboratory with three digital 
computers and one analog computer: the Xerox Data Systems (XDS) 9300 
medium scale digital computer, two Adage Graphic Terminals AGT-10, 
and the Comcor CI-5000 analog computer. Each digital computer is 
assigned a major task of the Ship Combat Weapon System while the 
analog computer (CI~-5000) simulates the physical missile mount. The 
physical identity of the computers, the system functions and the 


computer programs may be identified in figure 3. 
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The Adage Graphic Terminal computer number one (Adage 1 computer) 
contains the Radar, the Ship processor and the Monitor programs. 
Adage Graphic Terminal computer number two (Adage 2 computer) contains 
the Combat System Decisions program. The XDS—9300 computer contains 
the Missile Fire Control System Program and the CI-5000 operates as 
the Missile Mount Simulation. 

1, Types of Data paired 

Some typical types of data involved in a Ship Combat Weapon 
System are shown in table 1. Each of the five sections of table 1 
are typical of the data that each system would contain. This ata, 
is utilized by the Fault Monitoring program to analyze the Combat 
System for the detection of errors, The tault monitoring progran, 
will in turm pass this data to the Diagnostic program for further 
analysis and evaluation. 

The following types of data may be sampled for error detection 
and fault location by an indication of large jumps in the data. 

1. Radar azimuth 

2- Target range and bearing 

3e Gyro position, azimuth, pitch, and roll 

4. Speed 

5. Intercept point 

One time to fire 

Te Time to go 

8. Launcher angle ordered 

9. launcher limits, data sample rate and bearing 

10. Missile mount bearing and elevation 


Normally, this data would be a continuous stream. 
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Table 1 


Types of Data in the Simulated System 


Radar Video & Processor 
A. Azimuth 
B. Target Data 
1. Range 
2. Bearing 
Ship Information & Processor 
A. Gyro 
1. Position (Lat., Long.) 
2, Aximuth (Heading) 
peeeitcn,, Roll 
Be. Speed 
Combat System Decisions 
A. Computer Status 
1. Memory Available 
2. I/O Channels available 
3- CPU's available 
B. Target Data (Speed, Heading) 
C. Total System Configuration 
Missile Fire Control System 
A. Intercept Point 
B. Time to Fire 
C. Time to Go 


D. Target Destruction Evaluation 
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De 


(Table 1 Cont.) 


E. Launcher Angle Ordered 
F. Launcher Data Sample Rate 
G Launcher Bearing Linits 
Missile Mount 
A, Bearing (0,6,8) 
B. Elevation (J,9,¢) 
C. Operational Status 
1. Errors in Bearing & Elevation 


2. Drift Rates 
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Another method of error detection is by the analysis of data 
outside of some physical limits. 
1. Angles greater than 360° or negative angles. 
2. Target data outisde of radar range or negative range 
3. Pitch greater than + 20° 
4. Roll greater than t 90° 
5. Ship speed greater than + 100 kmots 
6. Acceleration greater than reasonable limits 

A few general rules may be given to assist in the detection 
of faults. 

1. Monitor data normally assumed to be continuous and smooth. 
When large deviations are detected, an error has occurred. 

2. Monitor physical data and check for data outside physical 
limitations, i.e., ships moving faster than 100 knots. 

3. Send test data to software subroutines and analyze the 
results. 

4. Send test instructions to the computers to check for central 
processing errors. 

The program interaction of the Ship Combat Weapon System is 
described in figure 4. Note the generalized use of the I/0 package for 
data exchange between programs. This general use of a common program 
simplifies the automatic reconfiguration program. Since all programs 
have the same requirements of input and output, then data handling 
is the same. When data from one program looks the same to another 
program, then program relocation is simplified. A program may be 
moved from one computer to another without a change to the T/o package 
program. Data sampling and fault monitoring also become simpler for the 
Bame reason. 
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2. Partitioning 


a. Hardware 

The hardware items of the simulated Ship Compat Weapon 
System were partitioned in section III into four major sections and 
many subsections. The four major computer subsystems have become 
the main hardware partitions. (see ies 3) Each digital computer is 
composed of subsections normally ascribed to digital computers; Central 
Processor Units (CPU), magnetic core, display units and input/output 
channels. Some items like Radar, Monitor and input data have been 
simulated with software routines for lack of the actual hardware ; 
devices. Because of the similarity of program size and overall progran 
action, partitioning by subsystems is a reasonable choice. By 
monitoring critical data within these subprograms, fault monitoring 
of a subsystem becomes simpler than monitoring the system in total. If 
any critical data from the radar subsystem is detected as erroneous, 
the direct assumption by the diagnostic routine is that the radar 
subsystem is at fault. 

The analog computer contains a simulation of the Missile 
Mount and therefore, the actual gear train and motor systems are 
simulated, The logic control and I/0 control are hardware accessable 
devices. 

Note the similarity of the hardware items among the computer 
systems. This allows a more direct method for fault detection and 
analysis. Since all three digital computers have central processor 
units and all have modular computer core memories (or simulated ones), 
then they are similar for hardware partitioning. One module of a 


computer core memory could be used to replace another that contains 
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faults. The replacement could be in the same computer or in alternate 
computers, Automatic reconfiguration is possible with similar 
interchangeable subsystems, 

Table 2 describes the critical data points in the Combat 
System used to detect the various hardware errors. With data monitoring 
of these hardware devices, any error may be analyzed to determine the 
actual device at fault. 

be Software 

Programs have been partitioned into the same four major 
partitions as the hardware partitioning. Each subsystem function, 
such as Radar or Ship Information, is used to separate major partitions. 
(see fig. 4) The detailed partitions are different tasks within each 
subsystem function, such as Input data, Simulation data or Tracking. 
Because each partitioned programming task is of approximately the 
game magnitude, relccation is greatly simplified. Upon software 
program reconfiguration, the first step will be to reload the sub= 
program at fault into the same computer as the one it first faulted 
in. If this fails, the program will be reloaded into another computer 
with available space. Software reconfiguration will be completed 
faster this way than by reloading the total system. 

The pertinent software errors for fault detection are 
shown in table 3. By diagnosing any of these errors, the faulting 
Subprogram may be easily determined in a short time. The subprograms 
may then be relocated by the system reconfiguration program. 

3. Interface Requirements 
a» Data Paths 


In order to examine the exchange of data in a detailed 
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Table 2 
Hardware Errors to Detect 


Radar 

A. Azimuth 

B, Target Data (range, bearing) 
Gyro 

A. Heading 

Hee Fitch, Roll 

Pit Log 

A. Speed 

Memory 

A. Read, Write 

I/O Channels 

A. Parity 

CPU 

A. Incorrect subroutine answers 
Missile Mount 

A. launcher Angle 

Be Bearing 

C. Elevation 


D, Drift rate 
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Table 3 


Software Error Detection 


Ship Information 

A. Intercept Point 

B. Time to Fire 

C. Time to Go 

Radar 

A. Target Data —- Speed, Heading 
Gyro 

A. Latitude 

B. Longitude 


C. Speed 
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manner, the actual electronic interface must be minutely analyzed. 
Table 4 describes the interface equipment that is involved in the 
system simulation program, Note the two different levels of core 
memory accessibility that are detailed for the XDS-9300 computer, 
(accessible and inaccessible). This is very appropriate since the 
modern modular computer system utilizes this technique of memory 
protection. Any data in this protected area of memory must first be 
accessed from inaccessable core and placed in accessable core. 

six levels of data accessability are described in ete 
4 to account for all possible types of interfaces. Some computer 
systems may only have one or two levels; the more complex systems 
may include all six types. Systems of all complexities will be 
represented by these levels of interfacing. 

Table 5 lists the data paths and gives for each the access 
time, hardware path, interface interfererce and typical types of data 
that would be retrieved. The data paths are the same as shown in 
table 4. The access times are the actual times required for both the 
software and the hardware. The first time specified is a fixed tine, 
the second is the time for each additional access. The colum 
described as “hardware path" describes the actual path the data takes 
through the system. The right arrow (+) shows the path of the data 
from one hardware item to another as abreviated in table 4. The 
interface interference describes the interruption that the data access 
causes to the other hardware and software systems. The typical types 


of data retrieved are as described in table 1. 


de 





1. 


De 
4. 
De 
6. 
Te 
8. 


Table 4 


INTERFACE EQUIPMENT COMPONENTS 


Abbreviation 
Adage memory (core) AM 
Adage t/o program AP 
Interface box (Adage to 9300) IB 
9300 memory, accessible (8k-32k) XMA 
9300 memory, inaccessible (0-8k) XMI 
9300 r/o program XP 
Hybrid interface box HI 
Analog Computer (CI-5000) AC 


DEPTH OF DATA ACCESS (see table 5) 


(Based on Monitor in Adage 1) 


Directly addressible 
Indirectly addressible 
Programatically accessible 
Programatically accessible 
Programatically accessible 


Programatically accessible 


Adage memory (core) 

Accessible core in 9300 
(level 1)  Imaccessible core in 9300 
(level 2) Alternate Adage core 
(level 3) Analog data (DAC-ADC) 


(level 4) Analog data indirect 
(by SCAN system) 
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be A Specific Example 
Figure 5 describes a specific example of retrieving data 
from a specific computer. Note the complex path that is necessary to 
access this data, ten data transfers in all. This is to be expected 
in large re2l-time systems and must be timed accurately. The data 
must first be requested from the data sampling program located in 
Adage 1. This request passes through the XDS-9300 computer to the 
Adage 2 computer. The Adage 2 computer must then access the requested 
data and pass it back to the Adage 1 computer via the XDS-9300 
computer. While this data path is long, the majority of the time 
required is for program initiations that must be set-up (1.95 m sec). 
Thereafter, only a short time (30 » sec) is required for each 
additional wor retrieved, i.e., 100 words may be transferred in 3 
m sec. 
4. Monitor Program Module 

The Monitor Program contains all items for Fault Monitoring 
as discussed. This program is divided into three related segments: 
(1) Program timing analysis and priority setup, (2) Data sampling 
and (3) Human operator interface. Each segment is independent, but 
relies wpon completion of related tasks. Upon completion of all tasks, 
the cycle of monitoring the total system is complete. 

ae Timing Analysis 

This subprogram samples the system usage of time to allow 

the most efficient allocation among the required tasks. Any time 
that remains after the required tasks have been completed is surplus 


time. In most systems, this surplus time is not utilized. For example, 
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if the timing analysis program detects three milliseconds of time left 
in a ten millisecond executive loop, it allocates as much of the 
three milliseconds to Fault Monitoring as is possible. The allocation 
of this time is used in three different modes. 

(1) Tight loading mode. When the computer system is 
only lightly loaded (say 30%) , many fault monitoring tasks may be 
accomplished. With this amount of time available, many monitoring 
tasks normally accomplished under maintenance down time may be loaded 
by segments into the computer memory from the disc by the monitor 
program. Since the probability of this lightly loaded condition 
occurring for a reasonable amount of time is high, many time slots 
allotted to the fault monitor may be utilized in loading fault 
detection and diagnostic programs for later execution. Execution 
of these programs in addition to those discussed below maintains the 
computer system at its greatest reliability. 


(2) Mediwn loading. When the computer system is at 





moderate loading (say 60%), little program swapping of monitor routines 
is allowed. All major system functions are monitored and routine 
maintenance tests are performed only periodically, a section at a time. 
By concentrating the monitoring function on detecting errors of major 
system functions, the up time reliability is greatly enhanced by 
insuring system operation. When an error is detected, a quick system 
reconfiguration reduces the Mean Time to Repair (IM1TR) to near zero. 
(3) Heavy loading, When a computer system is heavily 
loaded every millisecond is needed to maintain the system in operation. 


This is system utilization of about 90 percent. With such little time 
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available, most real-time multi-processor systems accomplish no soft~ 
ware fault monitoring at all. The results is that the smallest error 
can cause the total system to fail. This is a very bad mistake! It 
is in this situation that on-line Fault Monitoring is needed the most. 
By monitoring only the most critical data points and completing this 
task over many time slices, an effective monitoring program can be 
carried out even under heavy loading. Timing analysis is most 
important when there is very little time available. The process of 
dynamic time allocation can be shown to be most effective for the 


[1 


heavily loaded system. 2, When a serious permanent fault nee 
more time may be utilized for diagnostic routines to accurately locate 
the fault. With the imminent prospect of system failure, the locating 
and correcting of the fault now has highest priority. Only very 
perishable data need be saved so that an operable system may be 
restored on system restart. Therefore, a computer system with adequate 


fault monitoring will have greatly enhanced system reliability even 


during critical periods of heavy loading. 


b. Data Sampling 

The data that is needed for fault monitoring is sampled 
through the data paths and stored efficiently in core or disc. Re~- 
dundant data is filtered and only data permutations are actually 
stored. This process requires an intricate scheme for storing data 
since the core space and the time available are both critical. For 
example, the azimuth angular rotation of the missile mount is con= 
sidered to be continuous. Rather than store all data points over 


several seconds (about 500 points), only one data point need be saved. 
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This would be the "old" azimuth angle and would be compared to the 
newly acquired angle. The differénce would than be compared against a 
maximum allowed difference, Whenever this maximum difference was 
exceded, an error would be generated. Thus only three words ("old" 
angle, "new" angle, and maximum difference) must be stored compared 

to possibly 500. 

All data samples needed for alequate fault monitoring are 
grouped into sections, only those subprograms required are brought 
into core and executed. For example, the data elements (table 1) 
needed to be monitored for fault detection and the related extraction 


times (from table 5) are shown below. 


Data Access Time 

1. Radar azimuth 8.1 sec 
2. Radar target range 8.1 w sec 
4. Radar target bearing 8.1 sec 
4. Ship latitude Bei sec 
5. Ship longitude 8.1 sec 
6. Missile intercept point 370 w sec 
7. Missile time to fire 370 sec 
8. Missile target destruction evaluation 470 pp sec 
9. Missile mount bearing 470 w sec 
10. Missile mount elevation 470 uw sec 


Each data point can be retrieved individually or as a 
group. The access time of an individual data item is different than 
that of a group. The best method to choose is the method that results 


in the smallest average accenp time per data element. For example, by 
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summing the individual access time for data items one through five, we 
obtain 40.5 uw seconds, Table 5 shows that if more than one data 
element of this type is accessed, the time of access is 15.6 » seconds, 
Therefore by accessing data items one through five as a group, the 
required access time is 78.0 » seconds. In this example, it is more 
advantageous to access each data item individually than by a group. 
Similarly summing the individual access time of items six through ten, 
we obtain 1,850 1) seconds. Again using table 5, we find that these 
same data elements accessed as a group require only 402 yp seconds. ia) 
this example, the data access time of a group is much less than that 
of the same items retrieved individually. 
ce Human Operator Interface 

The human operator interface nodule accepts the sampled 
data and analyzes the data for faults. Upon detecting a fault, the 
pertinent data is displayed to the operator as an alert. Requests from 
the operator are input into this module and displayed in the proper 
format. All human input actions, such as a light pen hit or function 
switch depressed are recognized by this program module and acted upon. 

if AA ackation is requested by the system monitor 
operator, a system study is conducted by the reconfiguration program 
module to analyze the current configuration. Then the program looks 
up the entry in the reconfiguration table appropriate to the component 
which has failed and displays the recommended reconfiguration for 
operator approval. (see fig. 6) If the operator approves this 
reconfiguration, he presses a function switch labeled "accept" and 


the program continues and executes the recommended reconfiguration. 
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If the system monitor operator disapproves of the recommended re= 
configuration, he may alter the display by an appropriate manner and 
then order the computer to execute this new configuration. 

5. Combat Information Center Program (SMULA) 

A simulation of a Ship Combat Weapon System was programmed 
on the two Adage Graphic Terminals available at the Naval Postgraduate 
School. (gee Program A and B) Three systems were simulated: (1) Combat 
Information Center, (2) Radar and (3) Ship Information. 

The basic purpose of this simulation was to provide a model 
on which to test the ideas presented in the preceding section for : 
fault monitering system. The simulation provides an actual model of 
combat between a ship and an aircraft. <A display of the position of 
the ship and the airplane has been incorporated to allow visual 
following of the action. The simulation is programmed for the air- 
plane to approach and attack the ship, firing missiles at the ship 
when close enough. The ship in turn must detect the airplane, and 
make the decisions of hostility, of time to fire and of target 
destruction. 

ae Main Gontrol and Combat Information Display (CICP) 

This program operates in one Adage Graphic Display 


terminal (see Program A ). The timing of the overall simulation is 
controlled in this module. 


The command and control systems purpose is to accept data 
from the radar simulation and make a decision upon the identity of 
the target. If it is identified as hostile, a "kill" order is sent 
to the Fire Control Computer System Module. Since the time required 


for the decision process has a Poisson Distribution, it is simulated 
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by an constant eight second delay plus a random delay from an 
exponential random number generator with en expectation of four 
seconds. 

As the fire control module rotates the missile launcher, 
the missile launcher displayed on the ship moves in synchronism with 
the missile launcher simulation on the Concor. analog computer. After 
lock-on to the target, the combat system "launches" a salvo of two 
missiles. These are simulated on the Adage display as a pair of 
bright dots, one after the other, originating from the missile launcher 
and moving towards the target. If the missiles "hit" the target, the 
target explodes into many bits, simulated by many dots randomly spaced. 
If the plane launches a missile and "hits" the ship, the ship explodes 
in the same way. All] missiles are simulated as nuclear type. 

be Radar and Ship Information Simulator (RADAR) 

This program simulates the radar and ship information 
systems on the ship. It is located in one Adage Graphic Terminal. (see 
Program B) 

The radar's purpose is to detect all incoming targets as 
soon as possible and relay information on range and bearing to the 
Combat System. For the simulated radar, a maximum range of one hundred 
miles was chosen, The delay between the time that the approaching 
aircraft corsses the point of maximum range and the time that the 
target data is actually transmitted, has a uniform random distribution 
with a maximum of eight seconds and a minimum of four seconds, Hight 
seconds was chosen after considering the antenna rotational speed and 
the number of rotations required for the radar operator to confirm 


an actual target. 
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The Ship Information subprogram simulates the ship as 
moving on a steady course at a speed of thirty knots. This information 
is passed to the CICP program and is used to move the simulated ship 
display. 

The radar simulator has a model of a simulated airplane. 
The airplane model moves at a speed of 3,600 knots, The airplane 
MMete te iccirse of 270°, When it is within 100 miles of the 
ship, it turns to automatically attack the ship. Manual override 
controls are provided to control the course, altitude and missile 
firing. ‘These are controlled by function switches adjacent to i 
display. 

6. Fire Control System Program (Missile Mount) 

A simulation of a Digital Fire Control System and of a Missile 
launching Mount was programmed on the Xerox Data Systems (XDS) 9300 
digital computer and the Comcor CI~5000 analog computer. The Digital 
Fire Control System was simulated on the XDS—9300 computer (see Program 
C). The Missile Launching Mount was similated on the CI-5000 analog 
computer (see Program D). 

The basic purpose of these simulations was to provide a Digital 
Fire Control System and a Missile Mount to interact with the simulated Ship 
Combat Weapon System on the Adage Graphic Terminals. The Digital Fire 
Control System was written in FORTRAN IV on the XDS-9300 computer and 
uses its hybrid capabilities to communicate with the CI-5000 analog 
computer. The Missile Mount is simulated to act like a real mechanical- 
electrical missile mount. Upon assignment of a target azimuth, the 


missile mount moves as a missile mount aboard ship would move. 





a Digital Fire Control System 

The digital fire control system accepts target information 
from the Combat System and converts this rectangular coordinate data 
to polar coordinates for the fire control missile mount. It must then 
order the missile mount to move from its present azimuth to the target 
azimuth. Jn a total analog system, this would be all that would be 
required; the analog feedback system would effect the required movement. 
In a digital. system many improvements may be gained. Overshoot and 
time to rotate can be minimized wnder digital control, The digital 
control utiZizes a modified "Bang—Bang" approach that uses six ne 
Each phase implements a separate portion of the task of moving the 
missile mount. With appropriate programming, the missile mount moves 
at the fastest speed possible with the smallest overshoot. Digital 
Control optimizes the control of this simulated large and massive 
Missile Mount as it does in the real case. 

be Missile Launcher Mount 

The missile mount was simulated on the CI~5000 analog 
computer using hybrid computer techniques. The simulated mount consists 
of an amplidyne pore iicd generator that drives a large motor which in 
turn drives a 100:1 gear train connected to the missile mount. The 
amplidyne requires 34 volts per field ampere and in turn controls the . 
field coil of the generator that can produce 25 amperes at 440 volts. 
The generator drives a 200 horsepower motor at speeds up to 1150 RPM. 
The weight of the mount is 28 tons and may rotate at a rate of up to 
one radian per second, The resulting transfer and analog computer 


equations of the fourth order system are: 
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Lo 85g = 112 65 ~3030 5 + 3635 (104) e. 
II = ~05333 Lp + -0333 Bag 
HII 1, = 87 1,,- 7.67 0, 


IV 96 «= ,032 I. - 20909 oe 


m 
fee - = Kk (CO. - 6, ). 
Where gZ - generator m — motor 
1 - launcher f - field 
a —- armature K - constant 


Both azimuth and elevation controls are implemented on the 
analog computer and have been verified to be similar to a shipboard 
missile mount. The analog simulation adds realism to the combat 
system and allows actual hardware items to be monitored by the fault 


monitoring system. 


46 





V. RECOMMENDED TECHNIQUES 


By analyzing the process used in fault monitoring in the simlated 
Ship Combat Weapon System, the overall technique should now be clear. 
By applying the following techniques, a Real-Time Multi-Computer 
Monitoring system may be designed to operate effectively even under 


heavy loading conditions. 


A. DYNAMIC TIME ALLOCATION 

The hardware and software data transfer rate between all hardware 
components must be accurately determined by an interface timing study. 
This may be accomplished by writing a simple program loop passing data 
between the components. All critical data (critical to the hardware and 
software partitions) must be determined and listed as either hardware or 
software accessable data needed for data evaluation. From the interface 
timing study, the time required to access this data may be determined. 
From this data list, groups of data should ts determined so as to best 
fit the minimum time allotted to fault detection under heavily loaded 
Perations . This grouping must be done in conjunction with the study 
of Partitioning. When the final list is completed and all groupings 
made, this data becomes the basis of the data sampling program module. 

The timing analysis program module works directly with the data 
sampling program. By analyzing the system resources, time allocation 
may be distributed to a number of data sampling group subprograms and 
data analysis programs. For example, two milliseconds may be allocated 


to monitor all critical data points of the radar system. 
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B. PARTITIONING 

The multi-computer system must be partitioned into hardware and 
software logical subelements. By determining the degree of recon= 
figuration possible, e.g., the number of CPU's, the degree of partitioning 
becomes known in part because partitioning and reconfiguration are 
interrelated. Both must be determined together in order to optimize 
system resources. If two partitioned elenents may not be used inter— 
changeably for reconfiguration, then the partitioning is too small. 
Partitioning must also consider what data inside a proposed subelement 
is critical. Normally each logical element of a multi-computer 
system has se, number of data elements that can be used to determine when 
the logical element has failed. These data elements are the critical 
data points of this partitioned logical subelement. 

If a computer program is critical to the operation of a multi- 
computer system and has no replacement, then a simulation of the 
program should be included in the system. A simulation of a program 
may be a smaller version of the replaced —— or it may be a dummy 
program that allows the total system to remain operational at a 
reduced ese, Then a software fault in this eee that can not be 
corrected by the relocation of the program, may be corrected temporarily 
by the use of the simulation. While the simulation is maintaining 
the system at a degradated level, the error in the program may be 
corrected and the program then reinstated into the system. 

After the system has been satisfactorily partitioned, the subsystem 
elements become the system status list. The logical connections (I/0) 
of these partitions are then also fixed and inserted into tables in the 


reconfiguration program module. 
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C. FAULT RECOGNITION 

Since the critical data has been determined under the study of 
Dynamic Time Allocation, only the method of fault recognition remains. 
Software errors are only recognized by sortware routines, but hardware ~ 
faults are best detected by a combination of hardware and software. 
If hardware devices are present to detect the faults, then they should 
be used in preference to neetiie routines as hardware detection is 
much faster, If some faults require an exorbitant amount of time to 
be recognized in software, then the use of special purpose hardware 
registers and fault detectors should be studied. Special purpose 
hardware fault detectors operate at a much higher speed but may be 
more expensive than software subroutines. 

Permanent faults and errors may be detected andanalyzed by hardware 
or software, but transient faults and errors may only be economically 
analyzed by software routines. Provisions for detecting and analyzing 


transient errors must be included in the system. 


D. DIAGNOSTIC ROUTINES 

Diagnostic routines become smaller in fault monitoring programs that 
recognize faults at the subsystem level. Normally a fault may be due 
to any one of hundreds of likely components. All components must be 
diagnosed to determine which component is at fault. Since any sub- 
system may only have from three to five critical data points (for 
example), the diagnostic routine necessary to locate a subsystem error 
may be simpler in design. 

Normal diagnostic routines, used to diagnose computers end special 


hardware devices, are utilized in this fault monitoring system also. 
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Since most of these routines require run times of minutes, they must 
be segmented into logical time elements that may be called by the 
system dynamic time allocation routine and executed when the computer 
is lightly loaded. In this way complete diagnostic analysis of the 


total multi-computer system can be accompi.ished. 


E, SYSTEM RECONFIGURATION AND PRESENTATION 
Items A-D above compile all the necessary data for system re- 
configuration and presentation to the System Monitor Operator. The 
system status list that was generated fron the study of Partitioning 
may now be used to determine when and how a system might be reconfigured. 
A set of possible configuration lists or even a program that computes 
an acceptable reconfiguration is available. for the use of the recon~ 
figuration program. When requested, the program studies the submodule 
that has failed to see if it is on the current subsystem active list. 
(see fig. 7 for a sample list) If it is, it removes it and places it 
on the non-active list. The program searches the possible configuration 
lists until it finds a match with the current subsystem active list. If 
a match is not found, it notifies the system monitor operator and 
halts. Before continuing, if the epee eee mode is set, the 
reconfiguration is presented to the operator for approval. In the 
automatic mode, this step is omitted. By resetting the logical inter- 
face list, any faulty input/output ports are bypassed. If needed, a 
program may be relocated (by reloading it). If the fault is programmatic, 
a suitable simulation program may be loaded to replace the faulty one. 
The type of presentation displayed to the monitor operator depends 


upon the equipment available and the system being monitored. Since 
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EXAMPLE OF A SUBSYSTEM ACTIVE LIST 


ACTIVis 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
no 
yes 
yes 
yes 
yes 
yes 
yes 
yes 
no 
no 
no 
yes 
yes 


yes 


etc. 
Figure 7 


al 


COMPONENT (see fig. 3) 


RADAR 

SHIP DATA 
MONITOR 
CPU #1 

CORE ta 
CORE 1b 
CORE ic 
CORE 1d 
DISPLAY 1 
INTERFACE 1 
INPUT DATA 
OUTPUT DATA 
CPU #2 
CORE 2a 
CORE 2b 
CORE 2c 
CORE 2d 
DISPLAY 2 
INTERFACE 2 
DAC 


ADC 


etc. 





the data to be presented is voluminous and time critical, any mecahnical 
device wovld be too slow. Some type of Cathode Ray Tube (CRT) with 
function switches and maybe a typewriter input is needed. Then the 
fault, its location and recommended solutions can be displayed 
simultaniously. The recommended reconfiguration presentation can be 
either displayed as a logic diagram showing the reconfigured components 
and their links (see fig. 6) or the two lists (old and new recon- 
figuration lists) can be displayed side by side. Because of the rapid 
visual assimilation by the operator of the displayed data, rapid 


decisions may be made, 
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